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Abstract 

This  paper  presents  multi-conditional  learning  (MCL),  a 
training  criterion  based  on  a  product  of  multiple  conditional 
likelihoods.  When  combining  the  traditional  conditional 
probability  of  “label  given  input”  with  a  generative  proba¬ 
bility  of  “input  given  label”  the  later  acts  as  a  surprisingly  ef¬ 
fective  regularizer.  When  applied  to  models  with  latent  vari¬ 
ables,  MCL  combines  the  structure-discovery  capabilities  of 
generative  topic  models,  such  as  latent  Dirichlet  allocation 
and  the  exponential  family  harmonium,  with  the  accuracy  and 
robustness  of  discriminative  classifiers,  such  as  logistic  re¬ 
gression  and  conditional  random  fields.  We  present  results  on 
several  standard  text  data  sets  showing  significant  reductions 
in  classification  error  due  to  MCL  regularization,  and  sub¬ 
stantial  gains  in  precision  and  recall  due  to  the  latent  structure 
discovered  under  MCL. 


Introduction 

Conditional-probability  training,  in  the  form  of  maximum 
entropy  classifiers  (Berger  et  al.,  1996)  and  conditional  ran¬ 
dom  fields  (CRFs)  (Lafferty  et  al.,  2001;  Sutton  &  McCal¬ 
lum,  2006),  has  had  dramatic  and  growing  impact  on  natural 
language  processing,  information  retrieval,  computer  vision, 
bioinformatics,  and  other  related  fields.  However,  discrimi¬ 
native  models  tend  to  overfit  the  training  data,  and  a  prior  on 
parameters  typically  provides  limited  relief.  In  fact,  it  has 
been  shown  that  in  some  cases  generative  naive  Bayes  clas¬ 
sifiers  provide  higher  accuracy  than  conditional  maximum 
entropy  classifiers  (Ng  &  Jordan,  2002).  We  thus  consider 
alternative  training  criteria  with  reduced  reliance  on  parame¬ 
ter  priors,  which  also  combine  generative  and  discriminative 
learning. 

This  paper  presents  multi-conditional  learning,  a  family 
of  parameter  estimation  objective  functions  based  on  a  prod¬ 
uct  of  multiple  conditional  likelihoods.  In  one  configuration 
of  this  approach,  the  objective  function  is  the  (weighted) 
product  of  the  “discriminative”  probability  of  label  given  in¬ 
put,  and  the  “generative”  probability  of  the  input  given  la¬ 
bel.  The  former  aims  to  find  a  good  decision  boundary,  the 
later  aims  to  model  the  density  of  the  input,  and  the  single 
set  of  parameters  in  our  naive-Bayes-structured  model  thus 
strives  for  both.  All  regularizers  provide  some  additional 
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constraints  on  parameter  estimation.  Our  experimental  re¬ 
sults  on  a  variety  of  standard  text  data  sets  show  that  this 
density-estimation  constraint  is  a  more  effective  regularizer 
than  “shrinkage  toward  zero,”  which  is  the  basis  of  tradi¬ 
tional  regularizers,  such  as  the  Gaussian  prior — reducing  er¬ 
ror  by  nearly  50%  in  some  cases.  As  well  as  improving  ac¬ 
curacy,  the  inclusion  of  a  density  estimation  criterion  helps 
improve  confidence  prediction. 

In  addition  to  simple  conditional  models,  there  has  been 
growing  interest  in  conditionally-trained  models  with  latent 
variables  (Jebara  &  Pentland,  1998;  McCallum  et  al.,  2005; 
Quattoni  et  al.,  2004).  Simultaneously  there  is  immense  in¬ 
terested  in  generative  “topic  models,”  such  as  latent  Dirich¬ 
let  allocation,  and  its  progeny,  as  well  as  their  undirected 
analogues,  including  the  harmonium  models  (Welling  et  al., 
2005;  Xing  et  al.,  2005;  Smolensky,  1986). 

In  this  paper  we  also  demonstrate  multi-conditional  learn¬ 
ing  applied  to  latent-variable  models.  MCL  discovers 
a  latent  space  projection  that  captures  not  only  the  co¬ 
occurrence  of  features  in  input  (as  in  generative  models), 
but  also  provides  the  ability  to  accurately  predict  designated 
outputs  (as  in  discriminative  models).  We  find  that  MCL  is 
more  robust  than  the  conditional  criterion  alone,  while  also 
being  more  purposeful  than  generative  latent  variable  mod¬ 
els.  On  the  document  retrieval  task  introduced  in  Welling 
et  al.  (2005),  we  find  that  MCL  more  than  doubles  precision 
and  recall  in  comparison  with  the  generative  harmonium. 

In  latent  variable  models,  MCL  can  be  seen  as  a  form 
of  semi-supervised  clustering — with  the  flexibility  to  op¬ 
erate  on  relational,  structured,  CRF-like  models  in  a  prin¬ 
cipled  way.  MCL  here  aims  to  combine  the  strengths  of 
CRFs  (handling  auto-correlation  and  non-independent  input 
features  in  making  predictions),  with  the  strengths  of  topic 
models  (discovering  co-occurrence  patterns  and  useful  la¬ 
tent  projections).  This  paper  sets  the  stage  for  various  in¬ 
teresting  future  work  in  multi-conditional  learning.  Many 
configurations  of  multi-conditional  learning  are  possible,  in¬ 
cluding  ones  with  more  than  two  conditional  probabilities. 
For  example,  transfer  learning  could  naturally  be  configured 
as  the  product  of  conditional  probabilities  for  the  labels  of 
each  task,  with  some  latent  variables  and  parameters  shared. 
Semi-supervised  learning  could  be  configured  as  the  product 
of  conditional  probabilities  for  predicting  the  label,  as  well 
as  predicting  each  input  given  the  others.  These  configura¬ 
tions  are  the  subject  of  ongoing  work. 
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Multi-Conditional  Learning  and  MRFs 

In  the  following  exposition  we  first  present  the  general 
framework  of  multi-conditional  learning.  We  then  derive 
the  equations  used  for  multi-conditional  learning  in  sev¬ 
eral  structured  Markov  Random  Field  (MRF)  models.  We 
introduce  discrete  hidden  (sub-class)  variables  into  naive 
MRF  models,  creating  multi-conditional  mixtures,  and  dis¬ 
cuss  how  multi-conditional  methods  are  derived.  We  then 
construct  binary  word  occurrence  models  coupled  with  hid¬ 
den  continuous  variables,  as  in  the  exponential  family  har¬ 
monium,  demonstrating  the  advantages  of  multi-conditional 
learning  for  these  models  also. 

The  MCL  Framework 


regression  or  maximum  entropy  (Berger  et  al.,  1996)  model 
can  be  written  with  similar  naive  graphical  structures.  Here 
we  consider  naive  MRFs  which  can  also  be  represented  by 
a  similar  graphical  structure  but  define  a  joint  distribution  in 
terms  of  unnormalized  potential  functions. 

Consider  data  V  =  {{yn,Xjt„);n  =  1  = 

1 . . .  Mn}  where  there  are  N  instances  and  within  each  in¬ 
stance  there  are  Mn  realizations  of  discrete  random  vari¬ 
ables  { x } .  We  will  use  y„  to  denote  a  single  discrete  random 
variable  for  a  class  label.  Model  parameters  are  denoted  by 
6.  For  a  collection  of  N  documents  we  thus  have  Mn  word 
events  for  each  document.  The  joint  distribution  of  the  data 
can  be  modeled  using  a  set  of  naive  MRFs,  one  for  each 
observation  such  that 


Consider  a  data  set  consisting  of  *  =  1, . . . ,  N  instances.  We 
will  construct  probabilistic  models  consisting  of  discrete  ob¬ 
served  random  variables  { x } ,  discrete  hidden  variables  {2} 
and  continuous  hidden  variables  z.  Denote  an  outcome  of 
a  random  variable  as  x.  Define  j  =  1, ...  ,NS  pairs  of  dis¬ 
joint  subsets  of  observations  {xA}ij  and  {xB}ij,  where  our 
indices  denote  the  ith  instance  of  the  variables  in  subset  j. 
We  will  construct  a  multi-conditional  objective  by  taking  the 
product  of  different  conditional  probabilities  involving  these 
subsets  and  we  will  use  a.j  to  weight  the  contributions  of  the 
different  conditionals.  Using  these  definitions  the  optimal 
parameter  settings  under  our  multi-conditional  criterion  are 
given  by 

argmax]J^  J  P({{xA},{z},z}..\{xB}ij-,0)aidzij, 

*>J  Mo  J 

(1) 

where  we  derive  these  marginal  conditional  likelihoods  from 
a  single  underlying  joint  probability  model  with  parameters 
0.  Our  underlying  joint  probability  model  may  itself  be  nor¬ 
malized  locally,  globally  or  using  some  combination  of  the 
two. 

For  the  experiments  in  this  paper  we  will  partition  ob¬ 
served  variables  into  a  set  of  “labels”  y  and  a  set  of  “fea¬ 
tures”  x.  We  define  two  pairs  of  subsets:  {xa,xb}i  = 
{y,x}  and  {xa,Xb} 2  =  {x,  y}.  We  then  construct  multi¬ 
conditional  objective  functions  Cmc  with  the  following 
form 

C-mc  =  log  (P(y|x)“P(x|y)/3) 

=  OtCy\x{0)  +  fiCx\y(Q). 

In  this  configuration  one  can  think  of  our  objective  as  having 
a  generative  component  P(x|y)  and  a  discriminative  com¬ 
ponent  P(y  |x).  Another  attractive  definition  using  two  pairs 
is:  {xA,xB}i  =  {y,x}  and  {xA,xB}2  =  {x,  0},  giving 
rise  to  objectives  of  the  following  form 

C  =  log(P(y|x)QP(x)/3),  (3) 

which  represents  a  way  of  restructuring  a  joint  likelihood  to 
concentrate  modeling  power  on  a  conditional  distribution  of 
interest.  This  objective  is  similar  to  the  approach  advocated 
in  Minka  (2005). 


Mn 


P(x1,...,xMn,y\0)  =  -2<t>{y\Oy)Y[<t>{xj,y\0x,y)  (4) 

j= 1 


where 


Mn 


<t>(Xj,y\0X,y).  (5) 

J'= 1 


y  xi  xmu 


If  we  define  potential  functions  </>(•)  to  consist  of  exponen¬ 
tiated  linear  functions  of  multinomial  variables  (sparse  vec¬ 
tors  with  a  single  1  in  one  of  the  dimensions),  y  for  labels 
and  wj  for  each  word,  a  naive  MRF  can  be  written  as 


M„ 


p(y,  {w})  =  ^  exp  (  y Tdy  +  yT0xy  )  ■  (6) 

j- 1 


To  simplify  our  presentation,  consider  now  combining 
our  multinomial  word  variables  {w}  such  that  x  = 


E j'=\  wj)  !]■  One  can  also  combine  0y  and  9x  y  into  0 


such  that 


P{  y,x)  =  —  exp(yT0Tx) 


(7) 


Under  this  model,  to  optimize  Cmc  from  (2)  we  have 


P(y|x)  = 


where 


exp(y3  d 


tqt , 


Evexp(yT<?  x) 


Mn 


and  P(x|y)  = 


exp(yT^i  x) 


Z(  y) 


(8) 


z(y )  n exP(yTCt/wi) exp(yT6>,y).  (9) 

Wi  W M„  3=1 

The  gradients  of  the  log  conditional  likelihoods  contained  in 
our  objective  can  then  be  computed  using: 


N 


VCy{x(G)  =  ]T  X„ 


Ey  exP(y  e  x„)x„y 
Evexp(yT0TXn) 


=  Ar((xyT)p(x,y)  ^  ((xyT)p(yix)}p(X)) 


(10) 


Naive  MRFs  for  Documents 

The  graphical  descriptions  of  the  naive  Bayes  model  for  text 
documents  (Nigam  et  al.,  2000)  and  the  multinomial  logistic 


where  (-)_p(X)  denotes  the  expectation  with  respect  to  distri¬ 
bution  P(x)  and  we  use  P(x)  to  denote  the  empirical  distri¬ 
bution  of  the  data,  the  distribution  obtained  placing  a  delta 


Mn  Word  Events 


Figure  1:  (Left)  A  factor  graph  (Kschischang  et  al.,  2001)  for  a 
naive  MRF.  (Right)  A  factor  graph  for  a  mixture  of  naive  MRFs. 
In  these  models  each  word  occurrence  is  a  draw  from  a  discrete 
random  variable;  there  are  M„  random  variables  in  document  n. 


function  on  each  data  point  and  normalized  by  N.  To  com¬ 
pute  S7Cxiy(9x>y),  we  observe  that 


Mn 


Mn 


P(x|y)  =  IPK|y)  =  n 

j=l  3= 1 


/  oxp;y  f  i )  \ 

VEWj  exp(y T0lyvtj))' 

(11) 


and  therefore 


N  Mn 


^£-x\ y{@x,y)  —  '^2  '^2  (™j 


iYn  “  Wi 


n=l  j=l 


rynP(wJ>|yn))- 

(12) 


Mixtures  of  Naive  MRFs 

We  can  extend  the  basic  naive  MRF  model  shown  in  Figure 
1  (Left)  by  adding  a  hidden  subclass  variable  as  illustrated 
(Right).  In  a  mixture  of  naive  MRFs  the  joint  distribution  of 
the  data  for  each  observation  can  be  modeled  using 

j  M » 

P{{x},  y ,  z\8)  =  —<l>{y\8y)<l){y,  z\0ViZ)  <f>{xj,z \8XtZ), 

j= i 

(13) 

where  the  (f>(y,  z\8ViZ)  potential  encodes  a  sparse  compati¬ 
bility  function  relating  labels  or  classes  to  a  subset  of  states 
of  the  hidden  discrete  variable  z. 

To  optimize  a  mixture  of  naive  MRFs,  we  use  the  ex¬ 
pected  gradient  algorithm  (Salakhutdinov  et  al.,  2003).  In 
this  model  we  can  compute  the  gradient  of  the  complete  log 
likelihood  and  this  gradient  decomposes  with  respect  to  our 
expectation  such  that  the  following  computation  can  be  effi¬ 
ciently  performed, 

V^l  y(8)  =  -^lnP({x}\y;8) 

=  E  P(z\ix}’  y;  0)^  lnP({x},  z\y,  8). 

(14) 


For  example,  the  gradient  for  the  “weights”  AXe,zs  compris¬ 
ing  the  elements  of  the  potential  function  parameters  8XZ 
are  computed  from 


dcx[y{8) 

o\  ~  /  ,  /  ,  /  lP{zn\{x}n,yn,d)fXt.lZa{Xjln,. 

Xe'z°  n—1  j=l  L  Zn 

-EE  P({x}n,  Zn\yn ;  8')fXetZe  ( Xj,n j  Zn) 

Zn  {x}n 

(15) 


where  fXe,zs  (x,  z)  are  binary  feature  functions  evaluating  to 
one  when  the  state  of  x  =  xe  and  the  state  of  z  =  zs.  The 
updates  for  the  potentials  function  parameters  using  Cy\x 
take  a  form  similar  to  the  standard  “maximum  entropy”  gra¬ 
dient  computations,  augmented  with  a  hidden  variable.  We 
term  mixture  models  trained  my  multi-conditional  learning 
multi-conditional  mixtures  (MCM). 


Harmonium  Structured  Models 

A  harmonium  model  (Smolensky,  1986)  is  a  two  layer 
Markov  Random  Field  (MRF)  consisting  of  observed  vari¬ 
ables  and  hidden  variables.  Like  all  MRFs,  the  model  we 
present  here  will  be  defined  in  terms  of  a  globally  normal¬ 
ized  product  of  (unnormalized)  potential  functions  defined 
upon  subsets  of  variables.  A  harmonium  can  also  be  de¬ 
scribed  as  a  type  of  restricted  Boltzmann  machine  (Hinton, 
2002).  In  the  following  we  present  a  new  type  of  exponen¬ 
tial  family  multi-attribute  harmonium,  extending  the  models 
used  in  Welling  et  al.  (2005)  and  the  dual-wing  harmonium 
work  of  Xing  et  al.  (2005). 

Our  exponential  family  harmonium  structured  model  can 
be  written  as 

P(x,z|0)  :  exp  j  +  EeJfAz^ 

■  E2E°'J'.^'-^)  •>*<->' 

i  j 

where  z  is  a  vector  of  continuous  valued  hidden  variables, 
x  is  a  vector  of  observations,  0,;  represents  parameter  vec¬ 
tors  (or  weights),  8ij  represents  a  parameter  vector  on  a 
cross  product  of  states,  /,  denotes  feature  functions,  ©  = 
{8ij,  Si,  8j}  is  the  set  of  all  parameters  and  A  is  the  log- 
partition  function  or  normalization  constant.  A  harmonium 
model  factorizes  the  third  term  of  ( 1 6)  into  djj  f  ^  (x,: ,  z j )  = 
fi (xi )T W f.  fj (Zj),  where  Wf.  is  a  parameter  matrix  with 
dimensions  a  x  b,  i.e.,  with  rows  equal  to  the  number  of 
states  of  f  j  (xj )  and  columns  equal  to  the  number  of  states 
of  In  the  models  we  construct  here  we  will  use  bi¬ 

nary  word  occurrence  vectors  that  have  dimension  Mv,  the 
size  of  our  vocabulary.  This  is  in  contrast  to  our  models  in 
the  previous  section  where  we  had  a  different  number  of  dis¬ 
crete  word  events  Mn  for  each  document  n.  We  will  denote 
one  of  the  observed  input  variables  Xd  as  a  discrete  label 
denoted  as  y  in  Figure  2. 

Figure  2  illustrates  a  multi-attribute  harmonium  model  as 
a  factor  graph.  A  harmonium  represents  the  factorization  of 
a  joint  distribution  for  observed  and  hidden  variables  using 
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Figure  2:  A  factor  graph  for  a  multi-attribute  harmonium  model 
or  two  layer  MRF. 

a  globally  normalized  product  of  local  functions.  In  our  ex¬ 
periments  here  we  shall  use  the  harmonium’s  factorization 
structure  to  define  an  MRF  and  we  will  then  define  sets  of 
marginal  conditionals  distributions  of  some  observed  vari¬ 
ables  given  others  that  are  of  particular  interest  so  as  to  form 
our  multi-conditional  objective. 

Importantly,  using  a  globally  normalized  joint  distribution 
with  this  construction  it  is  also  possible  to  derive  two  consis¬ 
tent  conditional  models,  one  for  hidden  variables  given  ob¬ 
served  variables  and  one  for  observed  variables  given  hidden 
variables  (Welling  et  al.,  2005).  The  conditional  distribu¬ 
tions  defined  by  these  models  can  also  be  used  to  implement 
sampling  schemes  for  various  probabilities  in  the  underly¬ 
ing  joint  model.  However,  it  is  important  to  remember  that 
the  original  model  parameterization  is  not  defined  in  terms 
of  these  conditional  distributions.  In  our  experiments  be¬ 
low  we  use  a  joint  model  with  a  form  defined  by  (16)  with 
WT  =  [W^Wj]  such  that  the  (exponential  family)  condi¬ 
tional  distributions  consistent  with  the  joint  model  are 

P{ zn|x)  =  Af(zn;fi,T),  [i  =  n  +  WTx  (17) 

-P(x  b\z)  =  B{xb;Gb),  6b  =  6b+Wbz  (18) 

P(xd|z)  =  V(xd;Gd),  6d  =  Gd  +  Wdz ,  (19) 
where  A /”(),  B()  and  'D(')  represent  Normal,  Bernoulli  and 
Discrete  distributions  respectively.  The  following  equation 
can  be  used  to  represent  the  marginal  distribution  of  x, 

P(x\G,  A)  =  exp{0Tx  +  xT  Ax  —  A(6 ,  A)},  (20) 

where  A  =  ^WWT  and  9  combines  Gd  and  Gb.  The  labels 
for  this  model  are  the  discrete  random  variable  (i.e.  y  =  x,/) 
and  the  features  are  the  binary  variables. 

In  an  exponential  family  model  with  exponential  func¬ 
tion  F(x;  9),  it  is  easy  to  verify  that  the  gradient  of  the  log 
marginal  likelihood  C  of  the  observed  data  x,  can  be  ex¬ 
pressed 

d£(fl;x)  /  <9F(x;  6)  \  /0F(x;0)\ 

9®  A  90  /  p \  80  /  p(x.0)_ 

(21) 


where  (-)p^  denotes  the  expectation  under  the  empiri¬ 
cal  distribution,  (-)ptx)  is  an  expectation  under  the  models 
marginal  distribution  and  N  is  the  number  of  data  elements. 
We  can  thus  compute  the  gradient  of  the  log-likelihood  with 
respect  to  the  weight  matrix  W  using 
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where  Nd  are  the  number  of  vectors  of  observed  data,  x^) 
are  samples  indexed  by  j  and  Ns  are  the  number  of  samples 
used  per  data  vector,  computed  using  Gibbs  sampling  with 
conditionals  (17),  (18)  and  (19).  In  our  experiments  here  we 
have  found  it  possible  to  use  either  one  or  a  small  number  of 
Markov  Chain  Monte  Carlo  (MCMC)  (Andrieu  et  al.,  2003) 
steps  initialized  from  the  data  vector  (the  contrastive  diver¬ 
gence  approach  (Hinton,  2002)).  Standard  MCMC  approxi¬ 
mations  for  expectations  are  also  possible.  We  use  straight¬ 
forward  gradient-based  optimization  for  model  parameters 
with  a  learning  rate  and  a  momentum  term.  Finally,  for  con¬ 
ditional  likelihood  and  multi-conditional  likelihood  based 
learning,  gradient  values  can  be  obtained  from 


P(xb,xd) 


dF(xb,xd;0) 


8F(xb,xd;0) 


P(xd \xb;0)/  P(xb) 


P(xb  |xd;0)/  P(xd) 


Relationships  to  Other  Work 

Theoretical  and  empirical  results  in  Ng  and  Iordan  (2002) 
have  supported  the  notion  that,  while  a  discriminative  model 
may  have  a  lower  asymptotic  error  (with  more  data),  the 
error  rate  of  classifications  based  on  an  analogous  genera¬ 
tive  model  can  often  approach  an  asymptotically  higher  er¬ 
ror  rate  faster.  Hybrids  methods  combining  generative  and 
discriminative  methods  are  appealing  in  that  they  have  the 
potential  to  draw  upon  the  strengths  of  both  approaches.  For 
example,  in  Raina  et  al.  (2003),  a  high  dimensional  subset 
of  parameters  are  trained  under  a  joint  likelihood  objective 
while  another  smaller  subset  of  parameters  are  trained  un¬ 
der  a  conditional  likelihood  objective.  In  contrast,  in  our 
approach  all  parameters  are  optimized  under  a  number  of 
conditional  objectives. 

In  Corduneanu  and  Jaakkola  (2003),  a  method  character¬ 
ized  as  information  regularization  is  formulated  for  using 
information  about  the  marginal  density  of  unlabeled  data  to 
constrain  an  otherwise  free  conditional  distribution.  Their 
approach  can  be  thought  of  as  a  method  for  penalizing  de¬ 
cision  boundaries  that  occur  in  areas  of  high  marginal  den¬ 
sity.  In  terms  of  the  regularization  perspective,  our  multi¬ 
conditional  approach  uses  additional  or  auxiliary  conditional 
distributions  derived  from  an  underlying  joint  probability 
model  as  regularizers.  Furthermore,  our  approach  is  defined 
within  the  context  of  an  underlying  joint  model.  It  is  our 
belief  that  these  additional  conditional  distributions  in  our 


objective  function  can  serve  as  a  regularizer  for  the  condi¬ 
tional  distributions  we  primarily  care  about,  the  probability 
of  labels.  As  such,  we  weight  the  conditional  distributions 
differently  in  our  objective. 

With  equal  weighting  of  conditionals  and  an  appropriate 
definition  of  subsets  of  variables,  the  method  can  be  seen 
as  a  type  of  pseudo-likelihood  (Besag,  1975).  However,  our 
goals  are  quite  different,  in  that  we  are  not  trying  to  approx¬ 
imate  a  joint  likelihood,  but  rather,  we  wish  to  explicitly  op¬ 
timize  for  the  conditional  distributions  in  our  objective. 

The  mixtures  of  naive  MRFs  we  present  resemble  the 
multiple  mixture  components  per  class  approach  used  in 
Nigam  et  al.  (2000).  The  conditional  distributions  arising 
for  our  labels  given  our  data  are  also  related  to  mixtures  of 
experts  (Jordan  &  Jacobs,  1994),  conditional  mixture  mod¬ 
els  (Jebara  &  Pentland,  1998),  simple  mixtures  of  maximum 
entropy  models  (Pavlov  et  al.,  2002),  and  mixtures  of  condi¬ 
tional  random  fields  (McCallum  et  al.,  2005;  Quattoni  et  al., 
2004).  The  continuous  latent  variable  model  we  present  here 
is  similar  to  the  dual  wing  harmonium  or  two  layer  random 
field  presented  in  Xing  et  al.  (2005)  for  mining  text  and  im¬ 
ages.  In  that  approach  a  lower  dimensional  representation 
of  image  and  text  data  is  obtained  by  optimizing  the  joint 
likelihood  of  a  harmonium  model. 

Experimental  Results 

In  this  section,  we  present  experimental  results  using  multi¬ 
conditional  objective  functions  in  the  context  of  the  models 
described.  First,  we  apply  naive  Markov  random  fields  to 
document  classification  and  show  that  the  multi-conditional 
training  provides  better  regularization  than  the  traditional 
Gaussian  prior.  Next,  we  demonstrate  mixture  forms  of  the 
model  on  both  real  and  synthetic  data,  including  an  example 
of  topic  discovery.  Finally,  we  show  that  in  harmonium- 
structured  models,  the  multi-conditional  objective  provides 
a  quantitatively  better  latent  space. 

Naive  MRFs  and  MCL  as  Regularization 

We  use  the  objective  function  aCv\x(0 )  +/3Cx\y(6)  in  naive 
MRFs  and  compare  to  the  generative  naive  Bayes  model  and 
the  discriminative  maximum  entropy  model  for  document 
classification.  We  present  extensive  experiments  with  com¬ 
mon  text  data  sets,  which  are  briefly  described  below. 

•  20  Newsgroups  is  a  corpus  of  approximately  20,000 
newsgroup  messages.  We  use  the  entire  corpus  (abbre¬ 
viated  as  news),  as  well  as  two  subsets  ( talk  and  comp). 

•  The  industry  sector  corpus  is  a  collection  of  corporate 
webpages  split  into  about  70  categories.  We  use  the  en¬ 
tire  corpus  (sector),  as  well  as  three  subsets:  healthcare, 
financial  (finan ),  and  technology. 

•  The  movie  review  corpus  (movie)  is  a  collection  of  user 
movie  reviews  from  the  Internet  Movie  Database,  com¬ 
piled  by  Bo  Pang  at  Cornell  University.  We  used  the  po¬ 
larity  data  set  (v2.0),  where  the  task  is  to  classify  the  sen¬ 
timent  of  each  review  as  positive  or  negative. 

•  The  sraa  data  set  consists  of  73,218  UseNet  articles  from 
four  discussion  groups:  simulated  auto  racing,  simulated 
aviation,  real  autos,  and  real  aviation. 


•  The  Web  Knowledge  Base  (webkb)  data  set  consists  of 

webpages  from  four  universities  that  are  classified  into 
faculty,  student,  course,  and  project  (we  discard  the 

categories  of  staff,  department,  and  other). 

We  determine  a  and  /3,  the  weights  of  each  component  of 
our  objective  function,  and  the  Gaussian  prior  variance  cr2 
using  cross  validation.  Specifically,  we  use  10-fold  cross- 
validation,  with  5  folds  used  for  choosing  these  parameters 
and  5  folds  used  for  testing.  The  models  tend  to  be  quite 
sensitive  to  the  values  of  a  and  f3.  Additionally,  because 
there  is  no  longer  a  guarantee  of  convexity,  thoughtful  ini¬ 
tialization  of  parameters  is  sometimes  required.  In  future 
work,  we  hope  to  more  thoroughly  understand  and  control 
for  these  engineering  issues. 

During  preprocessing,  we  remove  words  that  only  occur 
once  in  the  each  corpus,  as  well  as  stopwords,  HTML,  and 
email  message  headers.  We  also  test  with  small-vocabulary 
versions  of  each  data  set  in  which  the  vocabulary  size  is  re¬ 
duced  to  2000  using  information  gain. 

The  results  are  presented  in  Table  1.  The  parenthesized 
values  are  the  standard  deviations  of  the  test  accuracy  across 
the  cross  validation  folds.  On  15  of  20  data  sets,  we  show 
improvements  over  both  maximum  entropy  and  naive  Bayes. 
Although  the  differences  in  accuracy  are  small  in  some 
cases,  the  overall  trend  across  data  sets  illustrates  the  po¬ 
tential  of  MCL  for  regularization.  In  fact,  the  difference  be¬ 
tween  the  mean  accuracy  for  maximum  entropy  and  MCL 
is  larger  than  the  difference  between  the  mean  accuracies  of 
naive  Bayes  and  maximum  entropy.  Across  all  data  sets,  the 
mean  MCL  accuracy  is  significantly  greater  than  the  mean 
accuracies  of  naive  Bayes  (p  =  0.001)  and  maximum  en¬ 
tropy  (p  =  0.0002)  under  a  one-tailed  paired  r-test. 

We  also  found  that  in  10  of  15  data  sets  on  which  we  also 
calculated  the  area  under  the  accuracy/coverage  curve,  MCL 
provided  better  confidence  estimates. 

Mixtures  of  Naive  MRFs 

In  order  to  demonstrate  the  ability  of  multi-conditional  mix¬ 
tures  to  successfully  classify  data  that  is  not  linearly  sep¬ 
arable,  we  perform  the  following  synthetic  data  experi¬ 
ments.  Four  class  labels  are  each  associated  with  four  4- 
dimensional  Gaussians,  having  means  and  variances  uni¬ 
formly  sampled  between  0-100.  Positions  of  data  points 
generated  from  the  Gaussians  are  rounded  to  integer  values. 
For  some  samples  of  the  Gaussian  means  and  variances — 
e.g.  an  XOR  configuration — a  significant  portion  of  the 
data  would  be  misclassified  by  the  best  linear  separator. 
MCMs,  however,  can  learn  and  combine  multiple  linear  de¬ 
cision  boundaries.  A  MCM  with  two  hidden  subclasses 
per  class  attains  an  accuracy  of  75%,  whereas  naive  Bayes, 
maximum  entropy,  and  non-mixture  multi-conditional  naive 
MRFs  have  accuracies  of  54%,  52%,  and  56%,  respectively. 
With  explicitly-constructed  XOR  positioning,  MCM  attains 
99%,  while  the  others  yield  less  than  50%. 

Running  these  MCMs  on  the  talk  data  set  yields  “top¬ 
ics”  similar  to  latent  Dirichlet  allocation  (LDA)  (Blei  et  al., 
2003),  except  that  parameter  estimation  is  driven  to  discover 
topics  that  not  only  re-generate  the  words,  but  also  help  pre¬ 
dict  the  class  label;  (thus  MCM  can  also  be  understood  as  a 
“semi-supervised"  topic  model).  Furthermore,  MCM  topics 


Data 

Naive  Bayes 

MaxEnt 

MCL 

news 

news  (2000) 

85.3  (0.61) 

76.4  (0.88) 

82.9  (0.82) 
77.4  (0.81) 

85.9  (0.89) 
77.7  (0.48) 

comp 

comp  (2000) 

85.1  (1.78) 

81.8(1.36) 

83.7  (0.68) 
82.2  (0.75) 

83.4  (0.94) 

84.0  (1.05) 

talk 

talk  (2000) 

84.6  (1.02) 

83.7  (2.17) 

82.3  (1.43) 
81.6  (2.27) 

83.7  (1.27) 

84.3  (1.21) 

sector 

sector  (2000) 

75.6  (2.05) 
73.9  (0.78) 

88.0(1.13) 

82.0(1.03) 

87.4  (0.84) 

83.2  (1.56) 

tech 

tech  (2000) 

91.0(1.33) 
92.9  (2.46) 

91.8  (2.24) 
91.4  (2.03) 

93.1  (1.69) 
94.5  (1.81) 

finan 

finan  (2000) 

92.3  (2.36) 

87.3  (3.31) 

89.2  (1.52) 
89.6(1.82) 

91.5  (2.57) 

94.6  (1.79) 

health 

health  (2000) 

93.5  (4.36) 
95.0  (5.00) 

94.0  (3.74) 
91.0  (3.39) 

95.5  (4.00) 
95.5  (4.30) 

movie 

movie  (2000) 

78.6(1.20) 

90.9(1.98) 

82.6  (2.96) 
88.8  (1.96) 

82.7  (2.50) 
94.0  (1.05) 

sraa 

sraa  (2000) 

95.9  (0.15) 
93.7  (0.20) 

96.1  (0.23) 
94.7  (0.13) 

96.7  (0.09) 
95.0  (0.21) 

webkb 

webkb  (2000) 

87.9  (2.14) 
84.7  (1.20) 

92.4  (0.84) 

92.4(1.07) 

92.4  (1.04) 
92.7  (1.40) 

mean 

86.5  (6.73) 

87.7  (5.39) 

89.4  (5.76) 

Table  1 :  Document  classification  accuracies  for  naive  Bayes,  max¬ 
imum  entropy,  and  MCL. 

are  defined  not  only  by  positive  word  associations,  but  also 
by  prominent  negative  word  associations.  The  words  with 
most  positive  and  negative  0X  Z  are  shown  in  Table  2. 

Lower-variance  Conditional  Mixture  Estimation 

Consider  data  generated  from  two  classes,  each  with  four 
sub-classes  drawn  from  2-D  isotropic  Gaussians  (similar  to 
the  example  in  Jebara  and  Pentland  (2000)).  The  data  are 
illustrated  by  red  o’s  and  blue  x’s  in  Figure  3.  Using  joint, 
conditional,  and  multi-conditional  likelihood,  we  fit  mixture 
models  with  two  (diagonal  covariance,  i.e.  naive)  subclasses 
using  conditional  expected  gradient  optimization  (Salakhut- 
dinov  et  ah,  2003).  The  figure  depicts  the  parameters  of  the 
best  models  found  under  our  objectives  using  ellipses  for 
constant  probability  under  the  model. 

From  this  illustrative  example,  we  see  that  the  parame¬ 
ters  estimated  by  joint  likelihood  would  completely  fail  to 
classify  o  versus  x  given  location.  In  contrast,  the  condi¬ 
tional  objective  focuses  completely  on  the  decision  bound¬ 
ary,  however,  in  30  random  initializations,  this  produced 
parameters  with  very  high  variance,  and  little  interpretabil- 
ity.  Our  multi-conditional  objective,  however,  optimizes  for 
both  class  label  prediction  and  class-conditioned  density, 
yielding  good  classification  accuracy,  and  sensible,  low- 
variance  parameter  estimates. 

Multi-Conditional  Harmoniums 

We  are  interested  in  the  quality  of  the  latent  representations 
obtained  when  optimizing  multi-attribute  harmonium  struc¬ 
tured  models  under  standard  (joint)  maximum  likelihood 
(ML),  conditional  likelihood  (CL)  and  multi-conditional 
likelihood  (MCL)  objectives.  We  use  a  similar  testing  strat¬ 
egy  to  Welling  et  al.  (2005)  but  focus  on  comparing  the 
different  latent  spaces  obtained  with  the  various  optimiza¬ 
tion  objectives.  As  in  Welling  et  al.  (2005),  we  used  the 


Topic  1  (gun  control) 

Topic  2  (Waco  incident) 

guns 

1.27 

nra 

1.63 

texas 

1.19 

assault 

1.52 

gun 

1.18 

waco 

1.21 

enforcement 

1.14 

compound 

1.19 

president 

-0.83 

employer 

-0.90 

peace 

-0.85 

cult 

-0.94 

years 

-0.88 

terrorists 

-1.02 

feds 

-1.17 

matthew 

-1.15 

Table  2:  Two  MCM-discovered  “topics”  associated  with  the 
politics  .  guns  label  in  a  run  on  talk  data  set.  On  the 
left,  discussion  about  gun  control  in  Texas.  The  negatively- 
weighted  words  are  prominent  in  other  classes,  including 
politics  .misc.  On  the  right,  discussion  about  the  gun  rights 
of  David  Koresh  when  federal  agents  stormed  their  compound  in 
Waco,  TX.  Aspects  of  the  Davidian  cult,  however,  were  discussed 
in  religion. misc. 
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Figure  3:  (Left)  Joint  likelihood  optimization.  (Middle)  One  of 
the  many  near  optimal  solutions  found  by  conditional  likelihood 
optimization.  (Right)  An  optimal  solution  found  by  our  multi¬ 
conditional  objective. 


reduced  20  newsgroups  data  set  prepared  in  MATLAB  by 
Sam  Roweis.  In  this  data  set,  16242  documents  are  repre¬ 
sented  by  100  word  vocabulary  binary  occurrences  and  are 
labeled  as  one  of  four  domains. 

To  evaluate  the  quality  of  our  latent  space,  we  retrieve 
documents  that  have  the  same  domain  label  as  a  test  doc¬ 
ument  based  on  their  cosine  coefficient  in  the  latent  space 
when  observing  only  binary  occurrences.  We  randomly  split 
data  into  a  training  set  of  12,000  documents  and  a  test  set  of 
4242  documents.  We  use  a  joint  model  with  a  corresponding 
full  rank  multi-variate  Bernoulli  conditional  for  binary  word 
occurrences  and  a  discrete  conditional  for  domains.  Figure  4 
shows  the  precision-recall  results.  ML-1  is  our  model  with 
no  domain  label  information.  ML-2  is  optimized  with  do¬ 
main  label  information.  CL  is  optimized  to  predict  domains 
from  words  and  MCL  is  optimized  to  predict  both  words 
from  domains  and  domains  from  words.  From  Figure  4  we 
see  that  the  latent  space  captured  by  the  model  is  more  rele¬ 
vant  for  domain  classification  when  the  model  is  optimized 
under  the  CL  and  MCL  objectives.  MCL  more  than  doubles 
the  precision  and  recall  at  reasonable  values  of  the  counter¬ 
parts. 

Discussion  and  Conclusions 

We  have  presented  multi-conditional  learning  in  the  context 
of  naive  MRFs,  mixtures  of  naive  MRFs  and  harmonium- 
structured  models.  For  Naive  MRFs,  we  show 

that  multi-conditional  learning  provides  improved  regu- 


Figure  4:  Precision-recall  curves  for  the  “20newsgroups”  data  us¬ 
ing  ML,  CL  and  MCL  with  20  latent  variables.  Random  guessing 
is  a  horizontal  line  at  .25. 


larization,  and  flexible,  robust  mixtures.  In  the  context 
of  harmonium-structured  models  our  experiments  show 
that  multi-conditional  contrastive-divergence-based  opti¬ 
mization  procedures  can  lead  to  latent  document  spaces  with 
superior  quality. 

Multi-conditional  learning  is  well  suited  for  multi-task 
and  semi-supervised  learning,  since  multiple  prediction 
tasks  are  easily  and  naturally  defined  in  the  MCL  frame¬ 
work.  In  recent  work  by  Ando  and  Zhang  (2005),  semi- 
supervised  and  multi-task  learning  methods  are  combined. 
Their  approach  involves  auxiliary  prediction  problems  de¬ 
fined  for  unlabeled  data  such  that  model  structures  arising 
from  these  tasks  are  also  useful  for  another  classification 
problem  of  particular  interest.  Their  approach  involves  find¬ 
ing  the  principal  components  of  the  parameters  space  for 
auxiliary  tasks.  One  can  similarly  use  the  MCL  approach  to 
define  auxiliary  conditional  distributions  among  features.  In 
this  way  MCL  is  a  natural  framework  for  semi-supervised 
learning.  We  are  presently  exploring  MCL  in  these  multi¬ 
task  and  semi-supervised  settings. 
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